F1 Score (f1_score)#
The F1 score (a.k.a. F-measure) summarizes performance on the positive class by combining:
Precision: when we predict positive, how often are we correct?
Recall: of all actual positives, how many did we find?
It’s especially common when:
the positive class is rare (class imbalance)
false positives and false negatives both matter (roughly equally)
Goals#
Derive the F1 formula from the confusion matrix.
Build a from-scratch NumPy implementation (binary + multiclass averages).
Visualize how thresholding changes precision, recall, and F1.
Use F1 to tune a simple logistic regression classifier.
Quick import#
from sklearn.metrics import f1_score
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score as sk_f1_score
from sklearn.metrics import precision_score as sk_precision_score
from sklearn.metrics import recall_score as sk_recall_score
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)
import sklearn
import plotly
print('numpy :', np.__version__)
print('sklearn:', sklearn.__version__)
print('plotly:', plotly.__version__)
numpy : 1.26.2
sklearn: 1.6.0
plotly: 6.5.2
1) Confusion matrix → precision, recall#
Assume binary classification:
true label: \(y \in \{0, 1\}\) (1 = positive)
predicted label: \(\hat{y} \in \{0, 1\}\)
The confusion matrix counts:
\(\hat{y}=1\) |
\(\hat{y}=0\) |
|
|---|---|---|
\(y=1\) |
TP |
FN |
\(y=0\) |
FP |
TN |
From these:
Precision asks: how noisy are our positive predictions?
Recall asks: how many positives did we miss?
# A tiny example
y_true = np.array([1, 1, 1, 0, 0, 0, 0, 0])
y_pred = np.array([1, 0, 1, 1, 0, 0, 0, 0])
tp = np.sum((y_true == 1) & (y_pred == 1))
fp = np.sum((y_true == 0) & (y_pred == 1))
fn = np.sum((y_true == 1) & (y_pred == 0))
tn = np.sum((y_true == 0) & (y_pred == 0))
tp, fp, fn, tn
(2, 1, 1, 4)
2) The F1 score#
The F1 score is the harmonic mean of precision and recall:
Substituting the confusion-matrix definitions gives a very useful form:
Key intuition:
Harmonic mean punishes imbalance: if precision is high but recall is near zero (or vice versa), \(F_1\) is near zero.
True negatives do not appear in the formula. That’s great when negatives are abundant (imbalance), but it can also hide poor performance on the negative class.
A generalization is the \(F_\beta\) score:
\(\beta>1\) emphasizes recall
\(\beta<1\) emphasizes precision
# Harmonic mean vs arithmetic mean
ps = np.linspace(0.001, 0.999, 400)
r_fixed = 0.2
f1 = 2 * ps * r_fixed / (ps + r_fixed)
am = 0.5 * (ps + r_fixed)
fig = go.Figure()
fig.add_trace(go.Scatter(x=ps, y=f1, mode='lines', name='F1 (harmonic mean)'))
fig.add_trace(go.Scatter(x=ps, y=am, mode='lines', name='Arithmetic mean', line=dict(dash='dash')))
fig.update_layout(
title='Same recall, changing precision: harmonic vs arithmetic mean',
xaxis_title='Precision',
yaxis_title='Score',
legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='left', x=0),
)
fig.show()
# F1 as a function of precision and recall
precision_grid = np.linspace(0, 1, 201)
recall_grid = np.linspace(0, 1, 201)
P, R = np.meshgrid(precision_grid, recall_grid)
den = P + R
F1 = np.zeros_like(den, dtype=float)
np.divide(2 * P * R, den, out=F1, where=den != 0)
fig = px.imshow(
F1,
x=precision_grid,
y=recall_grid,
origin='lower',
aspect='auto',
labels={'x': 'Precision', 'y': 'Recall', 'color': 'F1'},
title='F1 surface (heatmap) over precision/recall',
)
fig.update_layout(coloraxis_colorbar=dict(tickformat='.2f'))
fig.show()
3) NumPy implementation (from scratch)#
Below is a minimal implementation that mirrors common sklearn.metrics.f1_score behavior:
binary F1 via confusion-matrix counts
safe handling of zero division (when there are no predicted positives, or no true positives)
multiclass averages:
macro,micro,weighted
Convention:
when a denominator is zero, we return
zero_division(default0.0)
def _as_1d(a):
a = np.asarray(a)
return a.ravel()
def _safe_divide(num, den, zero_division=0.0):
num = np.asarray(num, dtype=float)
den = np.asarray(den, dtype=float)
out = np.full(np.broadcast(num, den).shape, float(zero_division), dtype=float)
np.divide(num, den, out=out, where=den != 0)
return out
def confusion_counts_binary(y_true, y_pred, *, pos_label=1):
y_true = _as_1d(y_true)
y_pred = _as_1d(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")
yt = y_true == pos_label
yp = y_pred == pos_label
tp = np.sum(yt & yp)
fp = np.sum(~yt & yp)
fn = np.sum(yt & ~yp)
tn = np.sum(~yt & ~yp)
return tp, fp, fn, tn
def precision_recall_f1_from_counts(tp, fp, fn, *, zero_division=0.0):
precision = _safe_divide(tp, tp + fp, zero_division=zero_division)
recall = _safe_divide(tp, tp + fn, zero_division=zero_division)
f1 = _safe_divide(2 * tp, 2 * tp + fp + fn, zero_division=zero_division)
return precision, recall, f1
def f1_score_binary(y_true, y_pred, *, pos_label=1, zero_division=0.0):
tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred, pos_label=pos_label)
_, _, f1 = precision_recall_f1_from_counts(tp, fp, fn, zero_division=zero_division)
return float(f1)
def f1_score_multiclass(y_true, y_pred, *, labels=None, average='macro', zero_division=0.0):
'''Multiclass/single-label F1 via one-vs-rest counts.
average: {'macro','micro','weighted', None}
'''
y_true = _as_1d(y_true)
y_pred = _as_1d(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")
if labels is None:
labels = np.unique(np.concatenate([y_true, y_pred]))
labels = np.asarray(labels)
tps = []
fps = []
fns = []
supports = []
for lab in labels:
tp = np.sum((y_true == lab) & (y_pred == lab))
fp = np.sum((y_true != lab) & (y_pred == lab))
fn = np.sum((y_true == lab) & (y_pred != lab))
tps.append(tp)
fps.append(fp)
fns.append(fn)
supports.append(np.sum(y_true == lab))
tps = np.asarray(tps)
fps = np.asarray(fps)
fns = np.asarray(fns)
supports = np.asarray(supports)
per_class_f1 = _safe_divide(2 * tps, 2 * tps + fps + fns, zero_division=zero_division)
if average is None:
return labels, per_class_f1
average = str(average).lower()
if average == 'macro':
return float(np.mean(per_class_f1))
if average == 'weighted':
w = _safe_divide(supports, supports.sum(), zero_division=0.0)
return float(np.sum(w * per_class_f1))
if average == 'micro':
tp = tps.sum()
fp = fps.sum()
fn = fns.sum()
return float(_safe_divide(2 * tp, 2 * tp + fp + fn, zero_division=zero_division))
raise ValueError("average must be one of: 'macro', 'micro', 'weighted', None")
# Quick sanity checks vs sklearn
y_true = rng.integers(0, 2, size=200)
y_pred = rng.integers(0, 2, size=200)
ours = f1_score_binary(y_true, y_pred)
sk = sk_f1_score(y_true, y_pred, zero_division=0)
print('binary f1: ours=', ours, 'sklearn=', sk)
y_true_mc = rng.integers(0, 3, size=300)
y_pred_mc = rng.integers(0, 3, size=300)
for avg in ['macro', 'micro', 'weighted']:
ours = f1_score_multiclass(y_true_mc, y_pred_mc, average=avg)
sk = sk_f1_score(y_true_mc, y_pred_mc, average=avg, zero_division=0)
print(f"multiclass {avg:8s}: ours={ours:.6f} sklearn={sk:.6f}")
binary f1: ours= 0.5539906103286385 sklearn= 0.5539906103286385
multiclass macro : ours=0.335908 sklearn=0.335908
multiclass micro : ours=0.336667 sklearn=0.336667
multiclass weighted: ours=0.336466 sklearn=0.336466
4) Thresholding: why F1 depends on the decision rule#
Many classifiers output a score or a probability \(\hat{p}(y=1\mid x)\).
To produce hard labels we pick a threshold \(t\):
Changing \(t\) changes FP/FN, therefore precision/recall, therefore F1.
A common way to use F1 for optimization is to choose \(t\) (and other hyperparameters) to maximize validation-set F1:
This is practical because:
F1 is not differentiable in the model parameters (it jumps when a single point crosses the threshold)
but it’s easy to optimize over a 1D threshold via a grid search
# Synthetic imbalanced dataset (2D for visualization)
X, y = make_classification(
n_samples=2500,
n_features=2,
n_informative=2,
n_redundant=0,
n_clusters_per_class=1,
weights=[0.9, 0.1],
class_sep=1.4,
random_state=7,
)
X_train, X_tmp, y_train, y_tmp = train_test_split(
X, y, test_size=0.4, stratify=y, random_state=7
)
X_val, X_test, y_val, y_test = train_test_split(
X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=7
)
# Standardize using train statistics (low-level)
mean_ = X_train.mean(axis=0)
std_ = X_train.std(axis=0)
std_ = np.where(std_ == 0, 1.0, std_)
X_train_s = (X_train - mean_) / std_
X_val_s = (X_val - mean_) / std_
X_test_s = (X_test - mean_) / std_
fig = px.scatter(
x=X_train_s[:, 0],
y=X_train_s[:, 1],
color=y_train.astype(str),
opacity=0.7,
labels={'x': 'x1 (standardized)', 'y': 'x2 (standardized)', 'color': 'class'},
title='Training data (imbalanced)',
)
fig.show()
print('class balance (train):', np.bincount(y_train) / y_train.size)
class balance (train): [0.8973 0.1027]
def add_intercept(X: np.ndarray) -> np.ndarray:
X = np.asarray(X, dtype=float)
return np.c_[np.ones((X.shape[0], 1)), X]
def sigmoid(z):
z = np.asarray(z, dtype=float)
out = np.empty_like(z)
pos = z >= 0
out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
ez = np.exp(z[~pos])
out[~pos] = ez / (1.0 + ez)
return out
def log_loss_from_proba(y_true, p, eps=1e-15):
y_true = np.asarray(y_true, dtype=float)
p = np.clip(np.asarray(p, dtype=float), eps, 1 - eps)
return -np.mean(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))
def fit_logistic_regression_gd(
X,
y,
*,
lr=0.2,
max_iter=2000,
alpha=0.0,
tol=1e-8,
):
'''Binary logistic regression with gradient descent + optional L2 penalty.'''
Xb = add_intercept(X)
y = np.asarray(y, dtype=float).ravel()
n, d = Xb.shape
w = np.zeros(d)
history = []
for _ in range(max_iter):
p = sigmoid(Xb @ w)
loss = log_loss_from_proba(y, p) + 0.5 * alpha * np.sum(w[1:] ** 2)
history.append(loss)
grad = (Xb.T @ (p - y)) / n
grad[1:] += alpha * w[1:]
w_new = w - lr * grad
if np.linalg.norm(w_new - w) < tol:
w = w_new
break
w = w_new
return w, np.asarray(history)
def predict_proba_logreg(X, w):
Xb = add_intercept(X)
return sigmoid(Xb @ w)
w, loss_hist = fit_logistic_regression_gd(X_train_s, y_train, lr=0.2, max_iter=3000, alpha=0.05)
fig = go.Figure()
fig.add_trace(go.Scatter(y=loss_hist, mode='lines', name='train log-loss'))
fig.update_layout(title='Training curve (log-loss)', xaxis_title='Iteration', yaxis_title='Log-loss')
fig.show()
w
array([-3.3621, -1.0781, 0.8771])
def precision_recall_f1_at_thresholds(y_true, y_score, thresholds, *, zero_division=0.0):
y_true = np.asarray(y_true).astype(int).ravel()
y_score = np.asarray(y_score, dtype=float).ravel()
thresholds = np.asarray(thresholds, dtype=float)
y_true_pos = y_true == 1
pred_pos = y_score[:, None] >= thresholds[None, :]
tp = np.sum(pred_pos & y_true_pos[:, None], axis=0)
fp = np.sum(pred_pos & ~y_true_pos[:, None], axis=0)
fn = np.sum(~pred_pos & y_true_pos[:, None], axis=0)
precision = _safe_divide(tp, tp + fp, zero_division=zero_division)
recall = _safe_divide(tp, tp + fn, zero_division=zero_division)
f1 = _safe_divide(2 * tp, 2 * tp + fp + fn, zero_division=zero_division)
return precision, recall, f1, tp, fp, fn
p_val = predict_proba_logreg(X_val_s, w)
thresholds = np.linspace(0.0, 1.0, 401)
prec_t, rec_t, f1_t, tp_t, fp_t, fn_t = precision_recall_f1_at_thresholds(
y_val, p_val, thresholds, zero_division=0.0
)
best_idx = int(np.argmax(f1_t))
t_best = float(thresholds[best_idx])
print('best threshold (val):', t_best)
print('F1 at best threshold (val):', float(f1_t[best_idx]))
best threshold (val): 0.34500000000000003
F1 at best threshold (val): 0.9702970297029703
fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=prec_t, mode='lines', name='precision'))
fig.add_trace(go.Scatter(x=thresholds, y=rec_t, mode='lines', name='recall'))
fig.add_trace(go.Scatter(x=thresholds, y=f1_t, mode='lines', name='F1', line=dict(width=3)))
fig.add_vline(x=t_best, line_width=2, line_dash='dash', line_color='black')
fig.update_layout(
title='Precision / Recall / F1 vs threshold (validation set)',
xaxis_title='Threshold t',
yaxis_title='Score',
legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='left', x=0),
)
fig.show()
def confusion_matrix_from_threshold(y_true, y_score, t):
y_pred = (np.asarray(y_score) >= t).astype(int)
tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred, pos_label=1)
mat = np.array([[tn, fp], [fn, tp]])
return mat, (tp, fp, fn, tn)
mat_05, counts_05 = confusion_matrix_from_threshold(y_val, p_val, 0.5)
mat_best, counts_best = confusion_matrix_from_threshold(y_val, p_val, t_best)
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=(
f't=0.50 (F1={f1_score_binary(y_val, (p_val>=0.5).astype(int)):.3f})',
f't={t_best:.2f} (F1={f1_score_binary(y_val, (p_val>=t_best).astype(int)):.3f})',
),
)
for col, mat in enumerate([mat_05, mat_best], start=1):
fig.add_trace(
go.Heatmap(
z=mat,
x=['Pred 0', 'Pred 1'],
y=['True 0', 'True 1'],
text=mat,
texttemplate='%{text}',
colorscale='Blues',
showscale=False,
),
row=1,
col=col,
)
fig.update_layout(title='Confusion matrices on validation set')
fig.show()
counts_05, counts_best
((44, 0, 7, 449), (49, 1, 2, 448))
# Precision-Recall curve with iso-F1 lines
# (each point corresponds to one threshold)
fig = go.Figure()
fig.add_trace(go.Scatter(x=rec_t, y=prec_t, mode='lines', name='PR curve'))
fig.add_trace(
go.Scatter(
x=[rec_t[best_idx]],
y=[prec_t[best_idx]],
mode='markers',
marker=dict(size=10, color='red'),
name=f'Best F1 (t={t_best:.2f})',
)
)
f_levels = [0.2, 0.4, 0.6, 0.8]
p_line = np.linspace(0.001, 1.0, 400)
for f in f_levels:
mask = p_line > (f / 2)
p = p_line[mask]
r = (f * p) / (2 * p - f)
r = np.clip(r, 0, 1)
fig.add_trace(
go.Scatter(
x=r,
y=p,
mode='lines',
line=dict(dash='dot', width=1),
name=f'F1={f}',
hoverinfo='skip',
)
)
fig.update_layout(
title='Precision–Recall curve (validation) with iso-F1 lines',
xaxis_title='Recall',
yaxis_title='Precision',
xaxis=dict(range=[0, 1]),
yaxis=dict(range=[0, 1]),
)
fig.show()
# How the threshold changes the *linear* decision boundary
# p = sigmoid(z) >= t <=> z >= log(t/(1-t))
def boundary_line(w, t, x1):
z_thr = np.log(t / (1 - t))
if np.isclose(w[2], 0.0):
return None
x2 = (z_thr - w[0] - w[1] * x1) / w[2]
return x2
x1 = np.linspace(X_train_s[:, 0].min() - 0.5, X_train_s[:, 0].max() + 0.5, 200)
x2_05 = boundary_line(w, 0.5, x1)
x2_best = boundary_line(w, t_best, x1)
fig = px.scatter(
x=X_train_s[:, 0],
y=X_train_s[:, 1],
color=y_train.astype(str),
opacity=0.6,
labels={'x': 'x1 (standardized)', 'y': 'x2 (standardized)', 'color': 'class'},
title='Logistic regression: threshold shifts the decision boundary',
)
if x2_05 is not None:
fig.add_trace(go.Scatter(x=x1, y=x2_05, mode='lines', name='t=0.50', line=dict(color='black')))
if x2_best is not None:
fig.add_trace(go.Scatter(x=x1, y=x2_best, mode='lines', name=f't={t_best:.2f}', line=dict(color='red')))
fig.show()
Evaluate on the test set#
We picked \(t^*\) on the validation set to avoid overfitting the threshold.
Now compare:
default \(t=0.5\)
tuned \(t=t^*\)
p_test = predict_proba_logreg(X_test_s, w)
def report_binary(y_true, p, t):
y_hat = (p >= t).astype(int)
tp, fp, fn, tn = confusion_counts_binary(y_true, y_hat)
prec, rec, f1 = precision_recall_f1_from_counts(tp, fp, fn)
return {
'threshold': float(t),
'precision': float(prec),
'recall': float(rec),
'f1': float(f1),
'tp': int(tp),
'fp': int(fp),
'fn': int(fn),
'tn': int(tn),
}
rep_05 = report_binary(y_test, p_test, 0.5)
rep_best = report_binary(y_test, p_test, t_best)
rep_05, rep_best
({'threshold': 0.5,
'precision': 1.0,
'recall': 0.7884615384615384,
'f1': 0.8817204301075269,
'tp': 41,
'fp': 0,
'fn': 11,
'tn': 448},
{'threshold': 0.34500000000000003,
'precision': 0.9791666666666666,
'recall': 0.9038461538461539,
'f1': 0.94,
'tp': 47,
'fp': 1,
'fn': 5,
'tn': 447})
5) Using F1 for model selection (simple “optimization” loop)#
F1 is typically used as a selection criterion rather than a differentiable training loss.
Example: tune L2 strength \(\alpha\) for logistic regression by:
fit the model for each \(\alpha\)
pick the threshold \(t\) that maximizes validation F1
choose the best \((\alpha, t)\) pair
alphas = [0.0, 0.01, 0.05, 0.2, 1.0]
thresholds = np.linspace(0.0, 1.0, 401)
results = []
for a in alphas:
w_a, _ = fit_logistic_regression_gd(X_train_s, y_train, lr=0.2, max_iter=3000, alpha=a)
p_val_a = predict_proba_logreg(X_val_s, w_a)
_, _, f1_a, _, _, _ = precision_recall_f1_at_thresholds(y_val, p_val_a, thresholds)
best_idx_a = int(np.argmax(f1_a))
results.append(
{
'alpha': float(a),
't_best': float(thresholds[best_idx_a]),
'f1_val_best': float(f1_a[best_idx_a]),
}
)
results
[{'alpha': 0.0, 't_best': 0.4875, 'f1_val_best': 0.9702970297029703},
{'alpha': 0.01, 't_best': 0.4175, 'f1_val_best': 0.9702970297029703},
{'alpha': 0.05,
't_best': 0.34500000000000003,
'f1_val_best': 0.9702970297029703},
{'alpha': 0.2, 't_best': 0.2575, 'f1_val_best': 0.9702970297029703},
{'alpha': 1.0, 't_best': 0.155, 'f1_val_best': 0.9702970297029703}]
alpha_vals = np.array([r['alpha'] for r in results])
f1_vals = np.array([r['f1_val_best'] for r in results])
best = results[int(np.argmax(f1_vals))]
fig = go.Figure()
fig.add_trace(go.Scatter(x=alpha_vals, y=f1_vals, mode='lines+markers', name='best val F1'))
fig.update_layout(
title='Validation F1 after threshold tuning vs L2 strength',
xaxis_title='alpha (L2 strength)',
yaxis_title='best validation F1',
)
fig.show()
best
{'alpha': 0.0, 't_best': 0.4875, 'f1_val_best': 0.9702970297029703}
6) Multiclass F1: macro vs micro vs weighted#
For multiclass single-label classification, F1 is usually computed by turning each class into a one-vs-rest problem.
macro: average F1 across classes (treat each class equally)
weighted: average F1 across classes weighted by class support
micro: compute global TP/FP/FN across classes before computing F1
Note: in single-label multiclass classification, micro F1 equals accuracy.
y_true_mc = np.array([0, 0, 0, 1, 1, 2, 2, 2, 2])
y_pred_mc = np.array([0, 2, 0, 1, 0, 2, 2, 1, 2])
labels, per_class = f1_score_multiclass(y_true_mc, y_pred_mc, average=None)
print('labels:', labels)
print('per-class F1:', per_class)
for avg in ['macro', 'micro', 'weighted']:
ours = f1_score_multiclass(y_true_mc, y_pred_mc, average=avg)
sk = sk_f1_score(y_true_mc, y_pred_mc, average=avg, zero_division=0)
print(f"{avg:8s}: ours={ours:.6f} sklearn={sk:.6f}")
labels: [0 1 2]
per-class F1: [0.6667 0.5 0.75 ]
macro : ours=0.638889 sklearn=0.638889
micro : ours=0.666667 sklearn=0.666667
weighted: ours=0.666667 sklearn=0.666667
Pros / cons and when to use F1#
Pros
Good default when the positive class is rare and you care about both FP and FN.
Single number that summarizes the precision–recall tradeoff.
Common in information retrieval, detection tasks, and many imbalanced classification settings.
Cons / limitations
Ignores true negatives: can be misleading if performance on the negative class matters.
Threshold-dependent: you must pick a threshold (or compare across thresholds).
Not a proper scoring rule (unlike log-loss / Brier), so it’s not ideal for probability calibration.
Not differentiable in model parameters → usually not used as a direct training loss.
Can hide tradeoffs: the same F1 can come from very different (precision, recall) pairs.
Good use cases
Highly imbalanced binary classification where the negative class is huge (fraud, churn, defect detection).
Search / ranking systems after choosing an operating point.
Segmentation / detection tasks (F1 is closely related to the Dice coefficient).
Common pitfalls + diagnostics#
Undefined divisions: if the model predicts no positives, precision is undefined. Decide a policy (
zero_division=0is common).Wrong averaging in multiclass:
macroemphasizes minority classes;weightedtracks overall distribution.Class imbalance doesn’t magically disappear: F1 helps compared to accuracy, but you still need proper validation and often threshold tuning.
If you need to compare models as rankers, prefer PR curves / average precision instead of a single F1 at one threshold.
If FP and FN have different costs, prefer \(F_\beta\) or an explicit cost-sensitive metric.
Exercises#
Implement \(F_\beta\) in NumPy and verify it against
sklearn.metrics.fbeta_score.For the logistic regression example, compare the threshold that maximizes F1 vs the threshold that maximizes accuracy.
Create an extremely imbalanced dataset (e.g. 99.5% negatives) and compare accuracy vs F1.
For multiclass, create a dataset with one rare class and compare
macrovsweightedF1.
References#
scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
scikit-learn user guide (precision/recall/F-score): https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics